PART 1 :                            Dunnhumby, The Complete Journey Analysis

INTRODUCTION:

The Dunnhumby Dataset consists of anonymized shopping data from 2,500 households and spanning over two years. The data is well formatted and requires very little processing. It contains generic information about the household and its composition, as well as consumption information in groceries, broad description of the products, and details about the potential discounts coupons that were redeemed.
In this analysis, we focus mainly on food products since we want to establish a link with openfoodfacts dataset.We hence use only three files of Dunnhumby :

  • hh_demographic.csv
  • product.csv
  • transaction_data.csv

We start studying each feature's statistics in order to identify the most relevant and important ones. Then we merge the three files in order to have one dataframe containing all the significant features. Then we analyse the correlations matrices between attributes of the dataset in order to find potential links.

For easier readability:

  • The notebook is divided into 3 parts each one focusing on a part of the analysis
  • Comments explain what has been done in the next cells up until the following comment. They are in italic.
In [1]:
import matplotlib.pyplot as plt 
import numpy as np 
import pandas as pd
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
In [2]:
def plotDistributionDataset(df):
    n_row, n_col = df.shape
    columns = list(df.columns)
    plt.figure(num = None, figsize = (5*3,5*np.ceil(n_col/3)))
    for i in range(n_col):
        plt.subplot(int(np.ceil(n_col/3)),3,i+1)
        subdf = df.iloc[:,i]
        if (not np.issubdtype(type(subdf.iloc[0]), np.number)):
            valueCounts = subdf.value_counts()
            valueCounts.sort_index(ascending=True).plot.bar()
        else:
            if(subdf.dtype == 'int64'):
                subdf.hist(bins = subdf.max()-subdf.min())
            else :
                subdf.hist()
                
        plt.ylabel('counts')
        plt.xticks(rotation = 90)
        plt.title(f'{columns[i]} (column {i})')
    plt.tight_layout(pad = 1.0, w_pad = 1.0, h_pad = 1.0)
    plt.show()
In [3]:
def correct_indecome(x):
    if(x == "Under 15K"):
        c = "-15K"
    elif(len(x) == 6):
        c = '-'.join([x.split('-')[0].zfill(3),x.split('-')[1].zfill(4)])
    else:
        c = x
    return c
In [4]:
def correct_marital_status(x):
    if(x == "A"):
        c = "Married"
    elif(x == "B"):
        c = "Single"
    elif(x == "U"):
        c = "Unknown"
    return c
In [5]:
def replace_unknowns(hh_comp, marital_status, hh_size):
    if(hh_comp == '2 Adults Kids' and marital_status == 'Unknown'):
        marital_status = 'Married'
    elif(hh_comp == '1 Adult Kids' and marital_status == 'Unknown'):
        marital_status = 'Single' 
    elif((hh_comp == 'Single Male' or hh_comp == 'Single Female') and marital_status == 'Unknown'):
        marital_status = 'Single' 
    elif(hh_size == '1' and marital_status == 'Unknown'):
        marital_status = 'Single' 
    else:
        marital_status = marital_status
    return marital_status
In [6]:
def ploting_stacked_plot(count,elements,list_specifics,specific,text):  
    count = np.array(count)
    c = pd.DataFrame(np.transpose(count),columns = tuple(list_specifics),index =tuple(set(elements)))
    sns.set()
    if len(elements) <= 12:
        sns.set_palette("Paired", 12)
    else:
        sns.set_palette("hls", 20)
    c.T.plot(kind = 'bar',stacked = True,figsize = (15,10))
    plt.ylabel('Purchases Counts per Food')
    plt.xlabel(specific)
    plt.title(['Amount of purchases of '+text+' foods along '+specific])
    plt.show()
In [7]:
def ploting_specific_general(dataset, specific):
    
    list_specifics = dataset.loc[:,specific].value_counts().sort_index(ascending=True).index
    list_food = []
    for i in range(len(list_specifics)):
        test = dataset[dataset.loc[:,specific].apply(lambda x: x == list_specifics[i])]
        list_food.append(list(test.SUB_COMMODITY_DESC.value_counts()[:10].index))
        
    count = 0
    elements3 = []
    elements2 = []
    for l in list_food:
        elements1 = []
        for element in l:
            inter = list_food.copy()
            inter.remove(l)
            list_final = [x for xs in inter for x in xs]
            if(element not in list_final or list_final.count(element) <int(0.50*len(list_specifics))):
                elements1.append(element)
                elements3.append(element)
            else:
                elements2.append(element)
        count += 1
        
    count1 = []
    count2 = []
    for list_specific in dataset.loc[:,specific].unique():
        for food in set(elements2):
            count2.append(int(dataset[(dataset.loc[:,specific] == list_specific) & (dataset.SUB_COMMODITY_DESC == food)].shape[0] / \
                              dataset[(dataset.loc[:,specific] == list_specific)].shape[0] * 100))

        count1.append(count2)
        count2 = []
    
    ploting_stacked_plot(count1,elements2,list_specifics,specific,'most commun') 
    
    count1 = []
    count2 = []
    for list_specific in dataset.loc[:,specific].unique():
        for food in set(elements3):
            count2.append(int(dataset[(dataset.loc[:,specific] == list_specific) & (dataset.SUB_COMMODITY_DESC == food)].shape[0] / \
                              dataset[(dataset.loc[:,specific] == list_specific)].shape[0]*100))

        count1.append(count2)
        count2 = []
        
    elements3 = list(set(elements3))
    count1_array = np.array(count1)
    elements = []
    count1 = np.array([])
    index_col = []
    for i in range(len(count1_array[0])):
        if(np.count_nonzero(count1_array[:,i]) != 0 and np.count_nonzero(count1_array[:,i]) <= 2):
            elements.append(elements3[i])
        else:
            index_col.append(i)
            
    count1 = np.delete(count1_array,index_col, 1)
    ploting_stacked_plot(count1,elements,list_specifics,specific,'specific')

A. Feature Analysis:

In this part, we load the three subdatasets we decided to use and conduct some preliminary study on each one of them.

A1. Demographic Analysis:

_Comments:
The hhdemographic subdataset contains demographic information on 801 households. For each one, it offers the age range, the marital status, the income range, type of ownership, household composition, size, and number of kids. This dataset is not complete since the Dunnhumby study was done on 2500 households, while there is demographic information on 801 only.

In [8]:
demographic = pd.read_csv('dataset/hh_demographic.csv', delimiter=',')
print(f'The number of rows and columns in the dataframe is respectively {demographic.shape[0]} and {demographic.shape[1]}')
demographic.head()
The number of rows and columns in the dataframe is respectively 801 and 8
Out[8]:
AGE_DESC MARITAL_STATUS_CODE INCOME_DESC HOMEOWNER_DESC HH_COMP_DESC HOUSEHOLD_SIZE_DESC KID_CATEGORY_DESC household_key
0 65+ A 35-49K Homeowner 2 Adults No Kids 2 None/Unknown 1
1 45-54 A 50-74K Homeowner 2 Adults No Kids 2 None/Unknown 7
2 25-34 U 25-34K Unknown 2 Adults Kids 3 1 8
3 25-34 U 75-99K Homeowner 2 Adults Kids 4 2 13
4 45-54 B 50-74K Homeowner Single Female 1 None/Unknown 16

Comments:
We can see that for the demgraphic subdataset, there is no NaN values so there is no cleaning to do on this part.

In [9]:
demographic.isna().all()
Out[9]:
AGE_DESC               False
MARITAL_STATUS_CODE    False
INCOME_DESC            False
HOMEOWNER_DESC         False
HH_COMP_DESC           False
HOUSEHOLD_SIZE_DESC    False
KID_CATEGORY_DESC      False
household_key          False
dtype: bool

Comments:
To have a better visualization of the plots below, we preprocess the class names on the Income feature so therefore we have it in ascending order.

In [10]:
demographic.INCOME_DESC = demographic.INCOME_DESC.apply(correct_indecome)
demographic.MARITAL_STATUS_CODE = demographic.MARITAL_STATUS_CODE.apply(correct_marital_status)

Comments:
We use our custom-made function in order to plot the distributions of each feature of the demographic dataset.

  • AGE : The age feature is made of 6 classes ranging from 19 years old to +65. The most common class (the one with the maximum value counts) is '45-54', with a value count of 288. It is also easily noticeable that the distribution between classes is not uniform.
  • Marital Status : This feature is made of 3 classes: "Married, Single and Unknown". The most commonn classes are 'Married' and 'Unknown' , with a value count of 340 and 344, respectively (The Unknown values will addressed later). For this feature also, the distribution is not uniform between classes.
  • Income : Made of 12 classes, this feature ranges from an income of under of 15K US dollars to +250k. The most common class is '50-74K', with a value count of 192. For this feature also, the distribution is not uniform between classes.
  • Homeowner: This one specifies the type of ownership of the household. This feature identifies if the individuals living in this house are homeowners or renters. It is made of 5 classes: 'Homeowner', 'Probable Owner', 'Probable Renter', 'Renter' an 'Unknown'. The most common type of ownership is 'Homeowner.
  • Household composition : This feature contains 6 classes detailing the composition of housholds. The most common class is '2 adults no kids'. Since this feature contains much less 'Unknown' values thann 'Marital Status', and could provide information about this latter, we intend to use this feature to complete the missing values of marital status.
  • Household Size: This feature gives the number of individuals living in a household. It is made of 5 classes ranging from 1 to +5 individuals. This attribute can also be used as an indicator for completing the marital status feature.
  • Kid Category: This attribute is the number of kids per household. It has 4 classes, '1' kid, '2' kids, '+3' kids as well as 'Unknown'. This latter is the most common one.
  • Household Key : we notice that the dataset contains demographic information for a portion of households, meaning that the demogrphic is not available for all households.
In [11]:
plotDistributionDataset(demographic)

_Comments: As we said previously, we use Household Composition and Household Size to replace the unknown values of Marital Status. Indeed, since these information are conceptually linked, we could infer from one the value of the other. We implemented a rule based function called replace_unknowns() that assigns a class to the marital status depending on the values of household composition and size. We consider 2 adults with kids as married, 1 adult with kids as single, single male/female as single, and if size is 1 we also assign single.
We can see that have been able to reduce the number of Unknown values from 344 to 95. We won't drop the remaining Unknowns since we intend to use mainly income, age and composition information, and we don't want to lose those rows._

In [12]:
demographic.MARITAL_STATUS_CODE = demographic.apply(lambda row: replace_unknowns(row['HH_COMP_DESC'],
                                                    row['MARITAL_STATUS_CODE'], row['HOUSEHOLD_SIZE_DESC']),axis=1)
In [13]:
demographic.MARITAL_STATUS_CODE.value_counts()
Out[13]:
Married    383
Single     323
Unknown     95
Name: MARITAL_STATUS_CODE, dtype: int64

Comments:
Now we try to take a look at the dependancy between our features. We start by one hot encoding our features since they are categorical.Then plot a correlation heatmap to compare features. We observe the following:

  • There is a relatively high correlation between marital status and the household size, which is expectable since mainly married couples would have a household size of more than 2. We also observe that households labeled single tend to have a size of 1.
  • There is a high correlation between marital status and homeownership status. We can see that married couples tend to possess the ownership of their house, while singles are mostly renters or with unknow ownership status.
  • Marital status shows also some correlation with the income. Married couples tend to have a high income ranging from 75K to +250K, while singles tend to have a lower income ranging from under 15K to 34K.
  • We could also observe a slight correlation between the age feature and the income. Household with le lowest age range 19-24 yo seem to be the ones with the highest income +250K but also with the lowest one under 15K. While the other classes of age up until 64 yo seem to have an income ranging between 50K and 199K. A higher age seems to be correlated with a lower income of 35K-49K.
In [14]:
one_hot_demographic = pd.get_dummies(demographic, columns=['AGE_DESC', 'MARITAL_STATUS_CODE', 'INCOME_DESC', 'HOMEOWNER_DESC', 
                                                           'HH_COMP_DESC', 'HOUSEHOLD_SIZE_DESC', 'KID_CATEGORY_DESC'])
In [15]:
corr = one_hot_demographic.corr(method = 'pearson')
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
fig, ax = plt.subplots(figsize=(15,15))         # Sample figsize in inches
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5});

A2. Transactions Analysis:

Comments:
The transaction data subdataset contains all the purchases made by households within the study.This table is transaction centered, it essentially provides for each transaction the ID of the household which bought it, the product ID, the quantity bought, the sales value which is the margin gained by the retailer, as well as the transaction time, day and week.

In [16]:
transaction = pd.read_csv('dataset/transaction_data.csv', delimiter=',')
print(f'The number of rows and columns in the dataframe is respectively {transaction.shape[0]} and {transaction.shape[1]}')
transaction.head()
The number of rows and columns in the dataframe is respectively 2595732 and 12
Out[16]:
household_key BASKET_ID DAY PRODUCT_ID QUANTITY SALES_VALUE STORE_ID RETAIL_DISC TRANS_TIME WEEK_NO COUPON_DISC COUPON_MATCH_DISC
0 2375 26984851472 1 1004906 1 1.39 364 -0.60 1631 1 0.0 0.0
1 2375 26984851472 1 1033142 1 0.82 364 0.00 1631 1 0.0 0.0
2 2375 26984851472 1 1036325 1 0.99 364 -0.30 1631 1 0.0 0.0
3 2375 26984851472 1 1082185 1 1.21 364 0.00 1631 1 0.0 0.0
4 2375 26984851472 1 8160430 1 1.50 364 -0.39 1631 1 0.0 0.0

Comments:
Since we saw in the demographic analysis that the first subdataset doesn't include all the household IDs. It is pertinant to keep the transactions made by households with provided demographic data only, thus the following.

In [17]:
demographic_households = demographic.household_key.unique()
transaction = transaction[transaction.household_key.apply(lambda x: x in demographic_households)]
print(f'The number of rows and columns in the dataframe is now respectively {transaction.shape[0]} and {transaction.shape[1]}')
The number of rows and columns in the dataframe is now respectively 1427303 and 12

Comments:
This dataset details all the transactions made by each household. Therefore, it gives information on their consumption over time. Using the custom function to plot the distributions of each feature, we get the following plots and observations:

  • _Week_no : This feature specifies the week during which the transaction has been made. It ranges from 0 to 102. We can see that the consumption starts low in the first weeks, and increases rapidly to reach a trend plateau of approximately 28000 transactions per week. Also we notice that in the last week, consumption explodes to nearly 50000 transctions. To keep the data uniform and have a fixed consumption over the weeks, we intend to drop the transactions made during the first and last weeks._
  • Transaction Time : This feature provides the time during the day at which the transaction has been made. It ranges from 0 to 2400, corresponding to the 24 hours of the day. We can see that most of the transactions are made during the day between 10:00 and 22:00, with a maximum of consumption at 18:00.
  • _Day: This feature specifies in which day the transaction has been made. It just gives a more precise date than the weekno attribute. As in this latter, we can see that there is a low consumption during the first days increasing to reach a plateau of approximately 4000 transactions per day. Since this feature represents a higher time resolution, we will drop the transactions made in the first and last few days to make the data more uniform over time. _The custom function does not give correct plots for the features QUANTITY and SALES_VALUE, because the distributions of these attributes are extremely skewed, this is why we plot them separatly. We obseve the following:_
  • Quantity : This feature gives the quantity of the product bought at each transaction. It is heavy tailed with value that reach more than 89000. However most of the values are between 0 and 10.
  • Sales value : This attribute reprensents the margin gained by the retailer in the transaction. The price of the product could be derived from this value, which we'll do later. It is also highly skewed but most values range between 0 and 20.
In [18]:
plotDistributionDataset(transaction.loc[:,['WEEK_NO','TRANS_TIME','DAY']])
In [19]:
plt.figure(num = None, figsize = (15,10))
plt.subplot(2,2,1)
ax1 = transaction.SALES_VALUE.hist(bins=1000)
ax1.set_xlim(0,20)
plt.subplot(2,2,2)
ax2 = transaction.QUANTITY.hist(bins=100000)
ax2.set_xlim(0,20)
ax1.title.set_text('SALES_VALUE')
ax2.title.set_text('QUANTITY')
plt.show()

Comments:
We plot a heatmap of the correlations between the transaction features, we observe the following:

  • _Most of the high correlations we observe are expected for example between 'Day' and 'WeekNo', 'Quantity' and 'Sales value'.
  • We observe a slightly high correlation between the product ID and Quantity, as well as with the sales value.
In [20]:
corr = transaction.corr(method = 'pearson')
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
fig, ax = plt.subplots(figsize=(10,10))         # Sample figsize in inches
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5});

_Comments:
For further use, we divide the Sales_value feature into two different columns, one where we keep the actual sales value, and the other where we sum all the sales values of transactions made by a household during the whole study. It results in an indicator of the total money spent by the household during the period of the study._

In [21]:
behavior_consumer = pd.DataFrame(transaction.groupby(['household_key']).sum().SALES_VALUE)
behavior_consumer['SALES_VALUE_TOTAL'] = behavior_consumer['SALES_VALUE']
behavior_consumer = behavior_consumer.drop(columns=['SALES_VALUE'])
transaction = pd.merge(transaction, behavior_consumer, on='household_key')
transaction.head()
Out[21]:
household_key BASKET_ID DAY PRODUCT_ID QUANTITY SALES_VALUE STORE_ID RETAIL_DISC TRANS_TIME WEEK_NO COUPON_DISC COUPON_MATCH_DISC SALES_VALUE_TOTAL
0 1364 26984896261 1 842930 1 2.19 31742 0.00 1520 1 0.0 0.0 2466.05
1 1364 26984896261 1 897044 1 2.99 31742 -0.40 1520 1 0.0 0.0 2466.05
2 1364 26984896261 1 920955 1 3.09 31742 0.00 1520 1 0.0 0.0 2466.05
3 1364 26984896261 1 937406 1 2.50 31742 -0.99 1520 1 0.0 0.0 2466.05
4 1364 26984896261 1 981760 1 0.60 31742 -0.79 1520 1 0.0 0.0 2466.05

A3. Products Analysis:

_Comments:
The products subdataset contains the list of products purchased by households during the study. It gives information about each product sold as type of product,national or private label and brand identifier. The main goal of the project is to analyse people's food consumption, we limited our analysis on food related product ('listfood'). We plot the distributions of each feature, we get the following plots and observations:

  • DEPARTMENT : Groups similar products together, the majority of product sold where in the Grocery departement. It has 10 unique classes.
  • BRAND : Indicates Private or National label brand, the large majority of product sold came from national brands.
  • _COMMODITY_DESC : Groups similar products together at a lower level. It provides 183 types of commodities (Soft drinks, bag snacks, frozen meat ...)_
  • _SUB_COMMODITY_DESC : Groups similar products together at the lowest level. It has 1319 classes (potato chips, spices and seasoning, beermalt liquors ...)_
In [22]:
product = pd.read_csv('dataset/product.csv', delimiter=',')
print(f'There are {product.shape[0]} rows and {product.shape[1]} columns')
product.head()
There are 92353 rows and 7 columns
Out[22]:
PRODUCT_ID MANUFACTURER DEPARTMENT BRAND COMMODITY_DESC SUB_COMMODITY_DESC CURR_SIZE_OF_PRODUCT
0 25671 2 GROCERY National FRZN ICE ICE - CRUSHED/CUBED 22 LB
1 26081 2 MISC. TRANS. National NO COMMODITY DESCRIPTION NO SUBCOMMODITY DESCRIPTION
2 26093 69 PASTRY Private BREAD BREAD:ITALIAN/FRENCH
3 26190 69 GROCERY Private FRUIT - SHELF STABLE APPLE SAUCE 50 OZ
4 26355 69 GROCERY Private COOKIES/CONES SPECIALTY COOKIES 14 OZ

_Comments:
We take a look at the distributions of the features Department and Brand since they have a limisted number of classes. We notice that most of the products correspond to the class 'Grocery', with a value count of nearly 40000, while the other classes contain less than 5000 products each. Also, we observe that more than 40000 products are labeled as 'National' while only 1000 are labeled as private.
Since the subdataset contains not only food related products we decided to drop some of them based on the department class. We only keep the departments mentioned in the list below called list_food._

In [23]:
list_food = ['GROCERY','PASTERY','PRODUCE','NUTRITION','MEAT','FROZEN GROCERY','SALAD BAR','SEAFOOD','SPIRITS','SEAFOOD-PCKGD','MEAT-PCKG','DELI'] 
food_related_products = product[product.DEPARTMENT.apply(lambda x: x in list_food)]
print(f'There are now {food_related_products.shape[0]} rows and {food_related_products.shape[1]} columns')

plt.figure(num = None, figsize = (15,10))
plt.subplot(2,2,1)
ax1 = food_related_products.DEPARTMENT.value_counts().plot.bar()
plt.subplot(2,2,2)
ax2 = food_related_products.BRAND.value_counts().plot.bar()
ax1.title.set_text('DEPARTMENT')
ax2.title.set_text('BRAND')
plt.show()
There are now 51331 rows and 7 columns

B. Merged Dataset Analysis:

B1. Merging subsets:

Comments:
In order to simplify the analysis of the dataset, we decided to merge the 3 subdatasets. We first drop the unwanted features from the subsets, then we merge the food related products and transactions subsets based on the product id. After that, we merge this new intersection dataset with the demographic subset based on the Houselhold key. Both mergers are made as an 'inner' join of subsets because we only want to keep the common inner features.
The new dataset contains all the features. It is household centric, for each household we have the information on the consumed products as well as on the purchase which makes the analysis simpler.

In [24]:
food_related_products.drop(columns = ['BRAND','CURR_SIZE_OF_PRODUCT','MANUFACTURER'],inplace = True)
consumution = transaction.drop(columns = ['STORE_ID','RETAIL_DISC','TRANS_TIME','COUPON_DISC','COUPON_MATCH_DISC'])
inter = pd.merge(food_related_products,consumution,on =  'PRODUCT_ID')
full_data_set = pd.merge(inter,demographic,on = 'household_key')
behavior_consummer = full_data_set.groupby(['household_key']).sum().SALES_VALUE
full_data_set.head()
Out[24]:
PRODUCT_ID DEPARTMENT COMMODITY_DESC SUB_COMMODITY_DESC household_key BASKET_ID DAY QUANTITY SALES_VALUE WEEK_NO SALES_VALUE_TOTAL AGE_DESC MARITAL_STATUS_CODE INCOME_DESC HOMEOWNER_DESC HH_COMP_DESC HOUSEHOLD_SIZE_DESC KID_CATEGORY_DESC
0 25671 GROCERY FRZN ICE ICE - CRUSHED/CUBED 1228 29046618323 157 1 3.49 23 14503.89 45-54 Single 100-124K Unknown Single Female 1 None/Unknown
1 43266 GROCERY FLUID MILK PRODUCTS FLUID MILK WHITE ONLY 1228 29513525602 187 1 0.89 27 14503.89 45-54 Single 100-124K Unknown Single Female 1 None/Unknown
2 68543 GROCERY DOG FOODS CAN DOGFD GOURMET/SUPER PREM ( 1228 31841925002 325 1 0.69 47 14503.89 45-54 Single 100-124K Unknown Single Female 1 None/Unknown
3 71759 GROCERY FROZEN PIE/DESSERTS FRZN WHIPPED TOPPING 1228 30589276083 235 1 0.69 34 14503.89 45-54 Single 100-124K Unknown Single Female 1 None/Unknown
4 73580 GROCERY ICE CREAM/MILK/SHERBTS QUARTS 1228 28394445201 117 1 4.99 17 14503.89 45-54 Single 100-124K Unknown Single Female 1 None/Unknown

B2. Relevant correlations analysis:

Comments:
We try to take a look at relevant correlations between the features that have been bought together in the new dataset. We decide to first evaluate the corrlation between income and age on the consumed products as well as the total spent money. We start by one hot encoding the feature Income then we plot a heatmap of the correlations, we observe the following:

  • _Most of the high correlations are between high incomes and total money spent during the study ('sales_valuey') .
  • _We observe a slightly negative correlation between low incomes and total money spent during the study ('sales_valuey').
In [25]:
full_data_set_test = pd.merge(full_data_set,behavior_consummer,on = 'household_key')
one_hot_income = pd.get_dummies(full_data_set_test, columns=["INCOME_DESC"])
corr = one_hot_income.corr(method = 'pearson')

corr.SALES_VALUE_y.sort_values(ascending = False)
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
fig, ax = plt.subplots(figsize=(10,10))         # Sample figsize in inches
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5});

Comments:
Now we do the same but with the AGE feature, in orther to evaluate correlations between age and consumed products and their sales value, we observe that:

  • There is a relatively high correlation between ages in the range 35-54 yo with the total sales value (total money spent).
  • The other age categories are slightly negatively correlated with the total sales value.
In [26]:
full_data_set_test = pd.merge(full_data_set,behavior_consummer,on = 'household_key')
one_hot_income = pd.get_dummies(full_data_set_test, columns=["AGE_DESC"])
corr = one_hot_income.corr(method = 'pearson')

corr.SALES_VALUE_y.sort_values(ascending = False)
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
fig, ax = plt.subplots(figsize=(10,10))         # Sample figsize in inches
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5});

B3. Influence of INCOME and AGE on food consumption:

Comments:
From the correlations previously computed, we noticed that there is a slight correlation between the age and the income of the household on their choice of food products depending on their sales value. We want to inspect more this correlation by looking at which foods are common to most of income classes and age categorie, and which ones are specific to each class. This would help us analyse the disparity of food consumption between social classes with the income and age as social indicators.

Comments:

For the income:

  • We computed the ten most consumed products for each category of income and then classified them as ‘common foods’ for all categories or ‘specific foods’.

In order to classify them as common foods, the product needs to be consumed by at least half of all income categories. Moreover, to classify them as specific foods, the product needs to be consumed by at most two different income categories.

  • Results for common foods: From the plot, we can say that every class consumed 6 items (‘bananas’,’ yogurt ’,’ bread ’,’ milk ’,’ soft drinks’, ‘cheese’). This six items are considered as commodities necessary for a balanced diet. Furthermore, this items are consumed in almost the same quantity for all the categories.

  • Results for specific food products: the results are more significant in this case, lower income categories consumed mainly base products. And higher incomes categories consumed more exotic food and healthy food such as strawberries, soup, premium food, 100% fruit juice.

In [27]:
ploting_specific_general(full_data_set,'INCOME_DESC')

Comments:
For the age:

  • We computed the ten most consumed products for each category of age and then classified them as ‘common foods’ for all categories or ‘specific foods’.

In order to classify them as common foods, the product needs to be consumed by at least half of all age categories. Moreover, to classify them as specific foods, the product needs to be consumed by at most two different age categories.

  • Results for common foods: From the plot, we can say that every class consumed 6 items (‘bananas’,’ yogurt ’,’ bread ’,’ milk ’,’ soft drinks’, ’cheese’). This six items are considered as commodities necessary for a balanced diet. Furthermore, this items are consumed in almost the same quantity for all the categories. We also have a new item that is ‘Potato chips’, that is mainly consumed by this two categories of age : ’19-24’,’25-34’.

  • Results for specific food products: the results are much more interesting than before, each age category, consumes different products :’45-54’ consumes as necessary ‘beer’, the'50-64’ categori consumes ‘Kids cereal’ meaning that they have children’s but, there was no visible correlation between the this category of age and kids, they also consumes ‘Economy dinners’.Finaly,the ‘65+‘ category are healthier and consumed more healthier food 100% fruit juice.

In [28]:
ploting_specific_general(full_data_set,'AGE_DESC')

Comments:
For the marital status,We observe the same status as before.

In [29]:
ploting_specific_general(full_data_set,'MARITAL_STATUS_CODE')

C. Further analysis : top products

Comments:
One thing the verify is if the different households buy always different items or they have habits when buying their food. From the histogram below, we can see that 63% of product are reordered and 35% of product are not.

In [30]:
reordered = pd.DataFrame(full_data_set.groupby(['household_key','SUB_COMMODITY_DESC']).count()[['DAY']].DAY.apply(lambda x: True if x>1 else False))
reordered.DAY.value_counts()
Out[30]:
True     134239
False     75892
Name: DAY, dtype: int64
In [31]:
plt.figure(num = None, figsize = (10,10))
plt.subplot(1,1,1)
ax = reordered.DAY.value_counts().plot.bar()
ax.title.set_text('Reoredered products')
plt.show()

D. Conclusions:

The Dunhumby dataset gives us valuable demographic information but also informations on how people consume and what they consume. With this dataset we were able to evaluate the social status of each household,and correlate this information with their food consumption.

=====================================================================================

PART 2 :                                        Open Food Fact Analysis

Introduction :

The Open Food Fact is a collaborative dataset of food products from arount the world, with ingredients, allergens, nutrition facts and all information we can find on product labels. Open Food Facts Database is a 2.1GB CSV file with tab separators. We will use the folowing features for our analysis :

  • Nutrition: features related to nutriements
  • Ingredients: features about composition of the food
  • Tag: information about product_name, countries etc...
In [32]:
import math
import pickle
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from matplotlib.ticker import MaxNLocator
from wordcloud import WordCloud, STOPWORDS
from scipy import stats
from sklearn import manifold, datasets
from adjustText import adjust_text
In [33]:
def highlight(f, n_labels, size=(10,10)):
    x, y = Y.T
    names = df_nutrition.product_name[X.index]
    names_list = names.tolist()
    colors = names.apply(lambda s: 'red' if f(s) else 'lightgray').tolist()
    
    fig, ax = plt.subplots(figsize=size)

    sc = plt.scatter(x, y, c=colors, s=5)
    
    texts = [plt.text(x[i], y[i], names_list[i], fontsize=10) for i in np.random.choice(len(X), n_labels)]
    adjust_text(texts)

    plt.show()
In [34]:
def highlight2(cols, title, size=(10,10)):
    x, y = Y.T
    
    def exists_col(r):
        for c in cols:
            if r[c] == 1:
                return True
        return False
    
    colors = matrix.loc[X.index.intersection(matrix.index)].apply(lambda r: 'red' if exists_col(r) else 'lightgray', axis=1).tolist()
    
    fig, ax = plt.subplots(figsize=size)

    plt.axis('off')
    
    sc = plt.scatter(x, y, c=colors, s=5)
    
    texts = [plt.text(x[i], y[i], names_list[i], fontsize=7) for i in np.random.choice(len(X), 0)]
    adjust_text(texts)

    plt.title(title)
    
    plt.show()
In [35]:
openfoodfacts = pd.read_csv('dataset/dataset_open_food.csv','\t')
print('Shape of the dataset:',openfoodfacts.shape)

openfoodfacts.head() 
#--> for the moment
Shape of the dataset: (1005487, 175)
Out[35]:
code url creator created_t created_datetime last_modified_t last_modified_datetime product_name generic_name quantity ... carbon-footprint-from-meat-or-fish_100g nutrition-score-fr_100g nutrition-score-uk_100g glycemic-index_100g water-hardness_100g choline_100g phylloquinone_100g beta-glucan_100g inositol_100g carnitine_100g
0 0000000000017 http://world-en.openfoodfacts.org/product/0000... kiliweb 1529059080 2018-06-15T10:38:00Z 1561463718 2019-06-25T11:55:18Z Vitória crackers NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 0000000000031 http://world-en.openfoodfacts.org/product/0000... isagoofy 1539464774 2018-10-13T21:06:14Z 1539464817 2018-10-13T21:06:57Z Cacao NaN 130 g ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 00000000001111111111 http://world-en.openfoodfacts.org/product/0000... openfoodfacts-contributors 1560020173 2019-06-08T18:56:13Z 1560020173 2019-06-08T18:56:13Z Sfiudwx NaN dgesc ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 0000000000123 http://world-en.openfoodfacts.org/product/0000... kiliweb 1535737982 2018-08-31T17:53:02Z 1535737986 2018-08-31T17:53:06Z Sauce Sweety chili 0% NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 0000000000178 http://world-en.openfoodfacts.org/product/0000... kiliweb 1542456332 2018-11-17T12:05:32Z 1542456333 2018-11-17T12:05:33Z Mini coco NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 175 columns

A. Cleaning the dataset:

In [36]:
nan_column = openfoodfacts.isnull().sum(axis=0).sort_values(ascending=False)
plt.figure(figsize=(10,5))
plt.title('Distributions of NAN over our columns')
nan_column.hist(bins='auto')
plt.xlabel('Number of NaN')
plt.ylabel('Number of columns')
plt.tight_layout()
plt.show()
In [37]:
plt.figure(figsize=(10, 50))
plt.rcParams['axes.facecolor'] = 'white'
plt.rc('grid')
(openfoodfacts.isnull().mean(axis=0)*100).plot.barh()
plt.xlim(xmax=100)
plt.title("Missing values rate")
plt.xlabel("percentage")
plt.show();

Comments:
_Furthermore, there are some missing values in the dataset. Actually some of these columns are nearly empty, for example the composition columns ("cocoa_100g", "zinc100g", etc...). But this can easily be explained by the fact that it is impossible to put on every product's label its complete composition.

Moreover, there is the possibility to discard features with more than 60% missing values (for instance), because these features are inconsistent.
However, we do not want to do that because in the following parts, we are going to discuss some specific aspects of our food activity so we will only take the parts from this dataset (the parts that are of interest of course).

Comments:
We focused on most part our analysis on American products,this is a decision we made in order to complete the Dunhumby dataset which is the US.

_Now, We first focus mainly on the nutrition table below, this will help to correct mistakes in the dataset. By comparing the energy100g and calculated energy from carbohydrates, proteins and fat, we can delet some wrong entries. If the sum of element in 100g of product is more than 100g this entry is wrong and can be deleted.

In [38]:
nutrition_table_cols = ["product_name", "ingredients_text", "allergens","energy_100g",
                        "fat_100g",
                        "carbohydrates_100g",
                        "sugars_100g",
                        "proteins_100g",
                        "salt_100g",
                        "allergens",
                        "ingredients_from_palm_oil",
                       "additives","countries_en",
                       "nutrition_grade_fr","additives_tags"]
In [39]:
df_nutrition = openfoodfacts[nutrition_table_cols].copy()
interest = df_nutrition.product_name.dropna().index.intersection(df_nutrition.ingredients_text.dropna().index)
interest = interest.intersection(df_nutrition[df_nutrition.countries_en == "United States"].index)
df_nutrition = df_nutrition.iloc[interest]
df_nutrition.head()
Out[39]:
product_name ingredients_text allergens energy_100g fat_100g carbohydrates_100g sugars_100g proteins_100g salt_100g allergens ingredients_from_palm_oil additives countries_en nutrition_grade_fr additives_tags
42 Banana Chips Sweetened (Whole) Bananas, vegetable oil (coconut oil, corn oil ... NaN 2243.0 28.57 64.29 14.29 3.57 0.00000 NaN NaN NaN United States NaN NaN
43 Peanuts Peanuts, wheat flour, sugar, rice flour, tapio... NaN 1941.0 17.86 60.71 17.86 17.86 0.63500 NaN NaN NaN United States NaN NaN
92 Organic Salted Nut Mix Organic hazelnuts, organic cashews, organic wa... NaN 2540.0 57.14 17.86 3.57 17.86 1.22428 NaN NaN NaN United States NaN NaN
93 Organic Polenta Organic polenta NaN 1552.0 1.43 77.14 NaN 8.57 NaN NaN NaN NaN United States NaN NaN
94 Breadshop Honey Gone Nuts Granola Rolled oats, grape concentrate, expeller press... NaN 1933.0 18.27 63.46 11.54 13.46 NaN NaN NaN NaN United States NaN NaN
In [40]:
df_nutrition["sum_elements"] = df_nutrition.fat_100g + df_nutrition.carbohydrates_100g +\
                            df_nutrition.proteins_100g

df_nutrition["sum_elements"] = round(df_nutrition.sum_elements)

df_nutrition["other_carbs"] = df_nutrition.carbohydrates_100g - df_nutrition.sugars_100g

df_nutrition["reconstructed_energy"] = df_nutrition.fat_100g * 37 + \
                                            (df_nutrition.proteins_100g + df_nutrition.carbohydrates_100g)* 17
df_nutrition.head()
Out[40]:
product_name ingredients_text allergens energy_100g fat_100g carbohydrates_100g sugars_100g proteins_100g salt_100g allergens ingredients_from_palm_oil additives countries_en nutrition_grade_fr additives_tags sum_elements other_carbs reconstructed_energy
42 Banana Chips Sweetened (Whole) Bananas, vegetable oil (coconut oil, corn oil ... NaN 2243.0 28.57 64.29 14.29 3.57 0.00000 NaN NaN NaN United States NaN NaN 96.0 50.00 2210.71
43 Peanuts Peanuts, wheat flour, sugar, rice flour, tapio... NaN 1941.0 17.86 60.71 17.86 17.86 0.63500 NaN NaN NaN United States NaN NaN 96.0 42.85 1996.51
92 Organic Salted Nut Mix Organic hazelnuts, organic cashews, organic wa... NaN 2540.0 57.14 17.86 3.57 17.86 1.22428 NaN NaN NaN United States NaN NaN 93.0 14.29 2721.42
93 Organic Polenta Organic polenta NaN 1552.0 1.43 77.14 NaN 8.57 NaN NaN NaN NaN United States NaN NaN 87.0 NaN 1509.98
94 Breadshop Honey Gone Nuts Granola Rolled oats, grape concentrate, expeller press... NaN 1933.0 18.27 63.46 11.54 13.46 NaN NaN NaN NaN United States NaN NaN 95.0 51.92 1983.63
In [41]:
print(df_nutrition.reconstructed_energy.describe())
print('='*40)
print(df_nutrition.energy_100g.describe())
print('='*40)
print(df_nutrition.sum_elements.describe())
count    169282.000000
mean       1146.580820
std         822.046523
min      -13600.000000
25%         354.985000
50%        1153.620000
75%        1700.000000
max       12321.000000
Name: reconstructed_energy, dtype: float64
========================================
count    170234.000000
mean       1131.014555
std         980.842778
min           0.000000
25%         360.000000
50%        1142.000000
75%        1674.000000
max      231199.000000
Name: energy_100g, dtype: float64
========================================
count    169282.000000
mean         52.870583
std          34.739996
min        -800.000000
25%          19.000000
50%          53.000000
75%          88.000000
max         641.000000
Name: sum_elements, dtype: float64

Comments: We notice that there are some obvious outliers :

  • There are some negative values in the calculated energy.
  • Energy amount exceeding 3700kJ (the maximum energy amount a product can have is 3700kJ and it corresponds to 100% of fat)
  • _The sum of elements (fat_100g,carbohydrates_100g and proteins100g) is negative in certain case and also higher than 100

We can detect the entries that are wrong on the plot below :

In [42]:
df_nutrition = df_nutrition.loc[df_nutrition.reconstructed_energy<=3700]
df_nutrition = df_nutrition.loc[df_nutrition.reconstructed_energy>0]
df_nutrition = df_nutrition.loc[df_nutrition.sum_elements<=100]
df_nutrition = df_nutrition.loc[df_nutrition.sum_elements>0]
df_nutrition = df_nutrition.loc[df_nutrition.energy_100g>0]
df_nutrition = df_nutrition.loc[df_nutrition.energy_100g<=3700]
df_nutrition.head()
Out[42]:
product_name ingredients_text allergens energy_100g fat_100g carbohydrates_100g sugars_100g proteins_100g salt_100g allergens ingredients_from_palm_oil additives countries_en nutrition_grade_fr additives_tags sum_elements other_carbs reconstructed_energy
42 Banana Chips Sweetened (Whole) Bananas, vegetable oil (coconut oil, corn oil ... NaN 2243.0 28.57 64.29 14.29 3.57 0.00000 NaN NaN NaN United States NaN NaN 96.0 50.00 2210.71
43 Peanuts Peanuts, wheat flour, sugar, rice flour, tapio... NaN 1941.0 17.86 60.71 17.86 17.86 0.63500 NaN NaN NaN United States NaN NaN 96.0 42.85 1996.51
92 Organic Salted Nut Mix Organic hazelnuts, organic cashews, organic wa... NaN 2540.0 57.14 17.86 3.57 17.86 1.22428 NaN NaN NaN United States NaN NaN 93.0 14.29 2721.42
93 Organic Polenta Organic polenta NaN 1552.0 1.43 77.14 NaN 8.57 NaN NaN NaN NaN United States NaN NaN 87.0 NaN 1509.98
94 Breadshop Honey Gone Nuts Granola Rolled oats, grape concentrate, expeller press... NaN 1933.0 18.27 63.46 11.54 13.46 NaN NaN NaN NaN United States NaN NaN 95.0 51.92 1983.63
In [43]:
# Visualization of the errors
plt.figure(figsize=(10,5))
plt.title('Distributions of given and calculated energy of the products')
plt.scatter(df_nutrition["energy_100g"], df_nutrition["reconstructed_energy"])
plt.xlabel('Energy_100 present in the dataset')
plt.ylabel('Calculated energy')
plt.tight_layout()
plt.show()

Comments: The relation between the calculated energy is linearly correlated with the given energy.The cleaning process is succesful for this part.

B. Analysis and Data Visualization:

Comments:
The challenge is now to parse and extract useful information from this ingredients text fields. It looks like there is no particular format in the data, except that the ingredients are separated by commas. The first intuition is to take advantage of that, namely split the ingredients on the comma character, trim them, convert them lower case string and explode the resulting series. This already gives a decent result but many rows contain additional data such as quantity, and that affects our grouping function. It turns out using some regular expressions we can normalize most of the rows without much work.

In [44]:
# Pipeline to extract the ingredients - can be improved but already works quite well
ingredients = df_nutrition.ingredients_text.apply(lambda s: list(map(lambda t: t.strip(), re.sub("\(.+?\)", '', s.lower()).split(','))))
ingredients = ingredients.explode()
exploded = ingredients.str.lower().str.replace("^.*?[0-9]+[^ ]*? ", '').str.replace(" [0-9].*$", '').str.replace("([0-9]*%|\.|[_\(\)\:\*\[\]])", '').str.replace('^[0-9]+$', '').str.strip()
exploded = exploded.str.replace(' +', ' ').str.replace('œ', 'oe').str.replace('[éèêë]', 'e').str.replace('[àâä]', 'a').str.replace('[îï]', 'i').str.replace('[uûùü]', 'u').str.replace('[ôö]', 'o')
exploded = exploded[exploded.apply(lambda s: len(s) > 0)]

by_count = pd.Series(exploded).value_counts()
In [45]:
counts = pd.DataFrame()
counts['ingredient'] = list(by_count.index)
counts['apparition'] = list(by_count)
In [46]:
d = {}
counter = 0
for a in counts.ingredient:
    d[a] = counts.apparition[counter]
    counter = counter + 1
    
    
wordcloud = WordCloud(width = 1000, height = 500, background_color="white")
wordcloud.generate_from_frequencies(frequencies=d)
plt.figure(figsize=(20,10))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

Comments:

With this first analysis, we can already see some obvious patterns emerging: without any surprise the three most common ingredients are salt, sugar and water.

Now we would like to plot our products on a 2D map grouped by clusters given their ingredients. This can be done using a dimensionality reduction algorithm. One of them is called t-SNE and can achieve this task fairly efficiently. The idea is to create a matrix of products which has 1k columns, one for each ingredient in the ranking by number of occurrence, and each cell containing a binary value describing if the ingredient is present or absent for this product. Ingredients beyond this ranking are irrelevant because they appear only so rarely, moreover it would make the algorithm unnecessarily slower.

In [47]:
# Number of ingredients (= dimensions) to consider
n_dimensions = 1000
# Number of rows (= datapoints) to consider at most
n_rows = 100000

# Minimum number of ingredients appearing in the list
min_ingredients = 3

relevant = by_count.head(n_dimensions).index.tolist()
products = exploded.iloc[df_nutrition.head(n_rows).index]

values = {}
for col in relevant:
    values[col] = products.apply(lambda s: 1 if s == col else 0).groupby(products.index).max()
matrix = pd.DataFrame(values)

matrix = matrix[matrix.sum(axis=1) >= min_ingredients]
In [48]:
matrix.head(5)
Out[48]:
salt sugar water citric acid corn syrup sea salt natural flavor spices high fructose corn syrup dextrose ... organic peanuts a b vitamin molasses powder bulgur wheat invertase cultured pasteurized nonfat milk organic diced tomatoes hot sauce red cabbage diced pears
115 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
117 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
118 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
126 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
128 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 1000 columns

Comments:
The matrix has the shape we want. Note that in the mean time we dropped rows with less than 3 ingredients in the matrix because they would create irrelevant datapoint.

We are now ready to apply the algorithm on the matrix (The execution takes some time).

In [49]:
X = matrix.head(10000)

n_components = 2 # Number of dimensions to reduce to
perplexity = 100 # t-SNE parameter (the higher the better the result, but also the more expensive)
tsne = manifold.TSNE(n_components=n_components, init='random', random_state=0, perplexity=perplexity)
Y = tsne.fit_transform(X)

Comments:
Here we plot some labels to roughly familiarize ourselves with the shape of our map. Here are some possible clusters that the algorithm might have identified:

  • Chocolate sweets
  • Fruit biscuits
  • Fishes
  • Cheese

We'll see later a robust method to confirm this intuition.

In [50]:
highlight(lambda s: False, 70, size=(15,10))

Comments:
In fact we want to make sure that the clustering is actually meaningful, one way to do that is to prove that the reduced dimensionality is correlated with an independent variable (= unused in the algorithm). One such variable would be the product name. Here we plot in red the products having chocolat in their name. Clearly the product names are correlated with the reduced dimensions. It also confirms the existence of one of the cluster we hypothesized earlier ("chocolate sweets").

In [51]:
highlight(lambda s: 'chocolate' in s.lower(), 0)

Comments:
Here are some other examples with more keywords:

In [52]:
highlight(lambda s: 'fruit' in s.lower(), 0)

Comments:
Let's highlight the ingredients individually to reason about our previous discoveries. Remark that we cannot make any conclusion about the position of the clusters in these plots because it is merely the result of a data representation algorithm. Simply put we only reordered the points so that products with similar ingredients fall close to each other.

In [53]:
highlight2(['salt'], 'Presence of salt in products')
In [54]:
highlight2(['sugar'], 'Presence of sugar in products')
In [55]:
highlight2(['water'], 'Presence of water in products')
In [56]:
highlight2(['palm oil'], 'Presence of palm oil in products')
In [57]:
# Size of the correlation matrix
n_correlations = 20

cmatrix = np.zeros((n_correlations, n_correlations))
columns = matrix.columns.tolist()[:n_correlations]

# https://en.wikipedia.org/wiki/Phi_coefficient
def phi_corr(n):
    top = n[1][1] * n[0][0] - n[1][0] * n[0][1]
    bot = math.sqrt((n[1][0] + n[1][1]) * (n[0][0] + n[0][1]) \
                    * (n[0][0] + n[1][0]) * (n[0][1] + n[1][1]))
    return top / bot
    
for i in range(n_correlations):
    a = columns[i]
    for j in range(i):
        b = columns[j]
        m = matrix[[a, b]]
        vs = [0, 1]
        result = np.zeros((len(vs), len(vs)))
        for x in vs:
            for y in vs:
                result[x][y] = len(m[(m[a] == x) & (m[b] == y)])

        cmatrix[i][j] = phi_corr(result)
In [58]:
mask = np.zeros_like(cmatrix, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

m = np.max(np.abs(cmatrix))

fig, ax = plt.subplots(figsize=(10,10))
ax.set_title('Correlations between the top %s ingredients' % n_correlations)
sns.heatmap(cmatrix, ax=ax, square=True, mask=mask, vmin=-m, vmax=m, xticklabels=columns, yticklabels=columns, cmap='coolwarm');

C. Additives Analysis:

Comments:

Now we will study some of the most important and trending topics in healthiness. In major health-related magazines, we usually talk about the following topics:

  • Additives
  • Allergens
  • Palm Oil

These are some negative aspects of our food consumption, and we are going to study these aspects in regard of the french food nutrition grade (given in many countries). Specifically, we are going to study the World and US consumptions concerning these aspects.

In this "additives" part, we are going to discuss the effect of the dangerous food additives on its nutrition grade, and consequently our everyday food consumption.
In fact, the nutrition grade is given in the range from "a" to "e" (from best grade to worst grade). But we want to know if the food products that have dangerous additives, or most additives will get good or bad nutrition grades. Additionally, we will compare these scores with the products that do not have any additives.

Hence, in order to determine which additives are bad we are going to first use the list in the "Hungry for Change" website. This list contains the most dangerous food additives that we can find on the market, but are not forbidden. This list was collected using different scientific studies on the matter.

Source: http://www.hungryforchange.tv/article/top-10-food-additives-to-avoid)

In [59]:
partb = pd.read_pickle('./dataset/partb.pkl')
additives_pickle = pd.read_pickle('./dataset/additives.pkl')
In [60]:
additives_new = additives_pickle.additives_tags.dropna().map(lambda x : x.lower())
additives_e250 = additives_new.str.contains('e250')
additives_nutrition = additives_pickle.nutrition_grade_fr
additives_both = additives_pickle.iloc[additives_e250.index]
In [61]:
additives_both[additives_both.nutrition_grade_fr.notnull()].sample(5)
Out[61]:
additives_tags additives_en nutrition_grade_fr
672136 en:e322 E322 - Lecithins e
335665 en:e14xx,en:e331,en:e440,en:e440i,en:e950,en:e955 E14XX - Modified Starch,E331 - Sodium citrates... b
640738 en:e322 E322 - Lecithins e
831794 en:e322,en:e322i E322 - Lecithins,E322i - Lecithin e
332746 en:e300 E300 - Ascorbic acid a
In [62]:
from scipy.stats import beta

order = ['a', 'b', 'c','d','e']

def exact_CI(x, N, alpha=0.95):
    x = float(x)
    N = float(N)
    p = round((x/N)*100,2)
    intervals = [round(i,4)*100 for i in beta.interval(alpha,x,N-x+1)]
    intervals.insert(0,p)
    result = {'Proportion': intervals[0], 'Lower CI': intervals[1], 'Upper CI': intervals[2]}
    return intervals

def grade_dist(title, df_dist, switch_CI=False):
    if (switch_CI == True):
        tmp_df = df_dist.apply(pd.value_counts).loc[order]
        tmp_values = tmp_df.values
        for i in range(len(tmp_values)):
            intervals = exact_CI(tmp_values[i], tmp_df.sum())
            print("The interval of confidence for the grade " + str(i) + " is between: " + str(intervals[1]) + " and " +
                                                                                            str(intervals[2]))
    
    fig = df_dist.apply(pd.value_counts).loc[order].plot(kind='bar', subplots=True, color = 'green')
    plt.rcParams['axes.facecolor'] = 'white'
    plt.title(title)
    plt.xlabel("Nutrition Grade")
    plt.ylabel("Number of products")
    plt.show();
In [63]:
additives_result_e250 = pd.DataFrame(additives_both.nutrition_grade_fr)
In [64]:
grade_dist("Grade Distribution for E250 additive in the World", additives_result_e250, True)
The interval of confidence for the grade 0 is between: 8.93 and 9.27
The interval of confidence for the grade 1 is between: 12.01 and 12.4
The interval of confidence for the grade 2 is between: 22.06 and 22.56
The interval of confidence for the grade 3 is between: 31.0 and 31.55
The interval of confidence for the grade 4 is between: 24.85 and 25.369999999999997

Comment:

Here we do find the e250 additive ( used for curing or preserving meat and fish products) which can cause acute methemoglobinemia (haemoglobin loses its ability to carry oxygen), irritability, lack of energy, headache, brain damages or even death in severe untreated cases. However, what is alarming is that we find many products which have this additive, but keep a "a" and "b" nutrition grade.

Now, we are going to do the same analysis on a the list given by the "Hungry for Change" website on the world dataset, and on the specific US dataset.

In [65]:
list_dangerous_additives = ['e951','e621', 'e133', 'e124','e110', 'e102', 'e221', 'e320', 'e220']
kwstr = '|'.join(list_dangerous_additives)
mask = additives_new.to_frame().stack().str.contains(kwstr).any(level=0)
df_additives = additives_new[mask]
additives_all = additives_pickle.iloc[df_additives.index]
In [66]:
additives_result_all = pd.DataFrame(additives_all.nutrition_grade_fr)
grade_dist("Grade Distribution for dangerous additives in the World", additives_result_all)
In [67]:
partb_new_null_additives = partb.additives_tags.isnull()
df_additives_null = partb.iloc[partb_new_null_additives.index]
df_result_additives_null = pd.DataFrame(df_additives_null.nutrition_grade_fr)
grade_dist("Grade Distribution for Non-Additives in the World", df_result_additives_null)

Comment:
As you can see from the two plots above, is that most of the products in the world do have a nutrition grade of "d", which is a pretty bad grade, but is not the worst grade.
Naturally, we will get less products with good grades ("a" and "b") in the grade distribution for additives, since products with additives should be more dangerous for our health.

In [68]:
partb.countries = partb.countries.str.lower()

# Fix some of the names with multiple entries (on ISO-apha 2 codes and country name)

partb.loc[partb['countries'] == 'en:fr','countries'] = 'france'
partb.loc[partb['countries'] == 'en:es','countries'] = 'spain'
partb.loc[partb['countries'] == 'en:gb','countries'] ='united kingdom'
partb.loc[partb['countries'] == 'en:uk','countries'] ='united kingdom'
partb.loc[partb['countries'] == 'españa','countries'] ='spain'
partb.loc[partb['countries'] == 'us','countries'] = 'united states'
partb.loc[partb['countries'] == 'en:us','countries'] ='united states'
partb.loc[partb['countries'] == 'usa','countries'] = 'united states'
partb.loc[partb['countries'] == 'en:cn','countries'] = 'canada'
partb.loc[partb['countries'] == 'canada','countries'] = 'canada'
partb.loc[partb['countries'] == 'en:au','countries'] = 'australia'
partb.loc[partb['countries'] == 'australia','countries'] = 'australia'
partb.loc[partb['countries'] == 'en:de','countries'] ='germany'
partb.loc[partb['countries'] == 'deutschland','countries'] ='germany'
partb.loc[partb['countries'] == 'en:be','countries'] ='belgium'
partb.loc[partb['countries'] == 'belgium','countries'] ='belgium'
partb.loc[partb['countries'] == 'en:ma','countries'] ='morocco'
partb.loc[partb['countries'] == 'morocco','countries'] ='morocco'
partb.loc[partb['countries'] == 'en:ch','countries'] ='switzerland'
partb.loc[partb['countries'] == 'switzerland','countries'] ='switzerland'
partb.loc[partb['countries'] == 'en:br','countries'] ='brazil'
partb.loc[partb['countries'] == 'brazil','countries'] ='brazil'
partb.loc[partb['countries'] == 'en:mx','countries'] ='mexico'
partb.loc[partb['countries'] == 'mexico','countries'] ='mexico'
partb.loc[partb['countries'] == 'en:ru','countries'] ='russia'
partb.loc[partb['countries'] == 'russia','countries'] ='russia'
partb.loc[partb['countries'] == 'en:dz','countries'] ='algeria'
partb.loc[partb['countries'] == 'algerie','countries'] ='algeria'
partb.loc[partb['countries'] == 'en:za','countries'] ='south africa'
partb.loc[partb['countries'] == 'south africa','countries'] ='south africa'
partb.loc[partb['countries'] == 'en:in','countries'] ='india'
partb.loc[partb['countries'] == 'india','countries'] ='india'
partb.loc[partb['countries'] == 'en:cn','countries'] ='china'
partb.loc[partb['countries'] == 'china','countries'] ='china'
In [69]:
## US plot ##

us = ['united states']
partb_us = partb[partb.countries.isin(us)]

partb_additives_us = partb_us.additives_tags.dropna().map(lambda x : x.lower())
additives_us = partb.iloc[partb_additives_us.index]

result_additives_us = pd.DataFrame(additives_us.nutrition_grade_fr)
grade_dist("Grade Distribution for dangerous additives in the US", result_additives_us)
In [70]:
partb_new_us_null_additives = partb_us.additives_tags.isnull()
df_us_additives_null = partb.iloc[partb_new_us_null_additives.index]
df_result_us_additives_null = pd.DataFrame(df_us_additives_null.nutrition_grade_fr)
grade_dist("Grade Distribution for Non-Additives in the US",df_result_us_additives_null)

Comment:
Regarding the US grade distribution for additives and non-additives products, we can highlight the fact that the US follow the same trends as the world when it concerns additives.
However, it is worth noticing that the second most frequent grade for "additives" products is "e" is the United States, and this is important because intuitively we should have more products in the right part of this distribution.

Now, we will study the distribution of those additives in many specific countries.

In [71]:
countries = ['france','united kingdom','spain','germany','united states','australia','canada', 'belgium', 'morocco', 'switzerland', 'brazil', 'mexico',
            'russia', 'algeria', 'south africa', 'china', 'india']
df_countries = partb[partb.countries.isin(countries)]
df_countries_additives = df_countries[df_countries.additives_n.notnull()]
df_groupedby_countries_additives = df_countries_additives.groupby(['countries']).mean().additives_n.reset_index()

np_countries_additives = np.array(df_groupedby_countries_additives)
np_countries_additives = np_countries_additives[np_countries_additives[:,1].argsort()[::-1]]

# Plot the average number of additives per country
fig = plt.figure(figsize=(23,8))
ax1 = fig.add_subplot(1,1,1)
y_pos = np.arange(len(np_countries_additives[:,0]))
x_pos = np_countries_additives[:,1]
x_ticks = np_countries_additives[:,0]

# Make a barplot
plt.bar(y_pos, x_pos, align='center')
plt.title('Average number of additives per product by country')
plt.xticks(y_pos, x_ticks)
plt.ylabel('Average number of additives') 
plt.show()
In [72]:
import plotly.express as px
import country_converter as coco
cc = coco.CountryConverter()

#we create here the ISO-3 country codes in order to plot our map in plotly
df_groupedby_countries_additives['iso_code'] = df_groupedby_countries_additives['countries'].apply(lambda x : coco.convert(names=x, to='ISO3'))

fig = px.choropleth(df_groupedby_countries_additives, locations="iso_code",
                    color="additives_n", # additives_n is a column of our dataframe
                    hover_name="countries", # column to add to hover information
                    color_continuous_scale=px.colors.sequential.Plasma)
fig.show()

Comments:

Here, we first ranked 17 countries from our list of countries in the dataset, in order to identify the countries which use the biggest average number of additivs per product. Normally, this ranking should reflect the food consumption habits and use of additives of its population.
It can be clearly seen that South Africa, Mexico, Belgium and Australia use the most additives out of these countries. This demonstrates some unhealthy additives habits of these countries but most importantly a lack of food regulation in some of these government administrations.

However, the countries that use the least number of additives are china and russia but we suspect that we have too little data in those countries, and this will badly influence our analysis.

Finally, we plotted a map in order to better visualize the differences between the countries from our study.

In [73]:
def prod_dist(title, feature, xlabel):
    plt.figure(figsize=(12,10))
    ax = partb[partb[feature] > 0.0][feature].value_counts().sort_index().plot(kind='bar',color='skyblue')
    plt.title(title,fontsize=14)
    plt.xlabel(xlabel)
    plt.ylabel('Number of Products')
    plt.show();

prod_dist('Distribution of the number of additives by products','additives_n', 'Additives')
In [74]:
def log_dist(title, feature, xlabel):
    plt.figure(figsize=(12,10))
    ax = partb[partb[feature] > 0.0][feature].value_counts().sort_index().plot(kind='line',color='skyblue')
    plt.title(title,fontsize=14)
    plt.xlabel(xlabel)
    plt.yscale("log")
    plt.ylabel('log-number of products')
    plt.show();

log_dist('Log-Distribution of the number of additives by products','additives_n', 'Additives')

Comment:

This first plot shows the distribution of the number of additives by products. It can be clearly seen that most of the products use 1 to 5 additives in their composition. However, what is astonishing is that we do find products with more than 15 additives in total ! Based on our research, the types of additives used are regulated and has to be approved prior to use. However, there is no limitations for the number of additives that can be added to food products.

Furthermore, we plotted a log distribution of the number of additives per products, and this clearly follows a power-law distribution !

In [75]:
import matplotlib.patches as mpatches

additives = (additives_pickle['additives_en'].str.extractall("(?P<Count>[E]\d\d\d\w?)"))
additives_count = additives.apply(pd.value_counts).head(20)
additives_count['additives_num'] = additives_count.index
additives_count.reset_index(drop=True,inplace=True)


additives_mapping = {'E330': 'black','E322':'red','E322i':'red','E101':'blue','E375':'black','E101i':'blue',
                    'E300':'yellow','E415':'red','E412':'black','E500':'black','E471':'red','E203':'green','E407':'red',
                    'E440':'red','E250':'green','E150a':'blue','E450':'black','E500i':'blue','E331':'black',
                     'E129':'black','E339':'black','E440i':'red','E160a':'blue','E270':'black','E102':'blue',
                     'E410':'red','E133':'blue','E341':'black','E428':'red','E621':'black','E202':'blue'}

additives_count['colors'] = additives_count['additives_num'].map(additives_mapping)

ax = additives_count.plot(x='additives_num',y='Count',kind='barh',color=additives_count['colors'],figsize=(15,10))
ax.invert_yaxis()
ax.legend().set_visible(False)
ax.set_title('20 most used additives in food products')

colors = mpatches.Patch(color='blue', label='food colouring')
others = mpatches.Patch(color='black', label='other')
emulsifiers = mpatches.Patch(color='red', label='emulsifiers')
antioxidant = mpatches.Patch(color='yellow', label='antioxidants')
preservatives = mpatches.Patch(color='green', label='food preservatives')

plt.legend(handles=[colors,others,emulsifiers,antioxidant,preservatives])
plt.xlabel('Number of products')
plt.ylabel('types of additives')
plt.show();

Comment:
This is the ranking of the 20 most used additives by number of products, and we distinguish the classes of additives.
In fact, this complete list of dangerous additives and their classes was taken from the World Health Organization and the FDA. And as you can see above, the most dangerous and popular additives in food products are "emulsifiers" (and "other") additives present in more than 100000 products in the dataset.

Sources:
https://www.fda.gov/food/food-additives-petitions/food-additive-status-list
https://www.who.int/en/news-room/fact-sheets/detail/food-additives

D. Allergens Analysis:

Comments:

In this part, we are going to study the impact of presence and absence of allergens on the nutrition grade.
More specifically, this part will also take into consideration the US-special case in order to compare it to the rest of the world.

In [76]:
df_us_allergens = partb_us[['nutrition_grade_fr', 'allergens']]
df_allergens_notnull = partb.allergens.dropna() 

print('the Shape of the US Dataframe : ', partb_us.shape)
print('Number of Allergens Descriptions in the whole Dataframe : ', len(df_allergens_notnull.index))
print('Number of Allergens Descriptions in the US Dataframe : ', len(df_us_allergens.allergens.dropna().index))
the Shape of the US Dataframe :  (176035, 11)
Number of Allergens Descriptions in the whole Dataframe :  82895
Number of Allergens Descriptions in the US Dataframe :  845
In [77]:
partb_new_allergens = partb.allergens.dropna().map(lambda x : x.lower())
df_allergens = partb.iloc[partb_new_allergens.index]
df_result_allergens = pd.DataFrame(df_allergens.nutrition_grade_fr)
grade_dist("Grade Distribution for Allergens in the World", df_result_allergens)
In [78]:
partb_new_null_allergens = partb.allergens.isnull()
df_allergens_null = partb.iloc[partb_new_null_allergens.index]
df_result_allergens_null = pd.DataFrame(df_allergens_null.nutrition_grade_fr)
grade_dist("Grade Distribution for Non-Allergens in the World", df_result_allergens_null)

Comments:

From the world dataset, we wanted to compare the grade distributions in regards to the presence and absence of allergens.
Here, we do not have a lot of differences between the two plots, but we can notice a slight difference on the grades "b" and "e" which are ranked higher in terms of frequencies compared to their equivalent in the non-allergens distribution.

In this part, we also see that for we have a higher distribution of allergens in the worst grades (similar to additives).

In [79]:
partb_new_allergens_us = partb_us.allergens.dropna().map(lambda x : x.lower())
df_us_allergens = partb.iloc[partb_new_allergens_us.index]
df_result_us_allergens = pd.DataFrame(df_us_allergens.nutrition_grade_fr)
grade_dist("Grade Distribution for Allergens in the US",df_result_us_allergens)
In [80]:
partb_new_us_null_allergens = partb_us.allergens.isnull()
df_us_allergens_null = partb.iloc[partb_new_us_null_allergens.index]
df_result_us_allergens_null = pd.DataFrame(df_us_allergens_null.nutrition_grade_fr)
grade_dist("Grade Distribution for Non-Allergens in the US",df_result_us_allergens_null)

Comment:

On the other hand, the distribution of the allergens (compared to the non-allergens) on the US market, show a total imbalance on the nutrition grade. In fact, we have the same number of products with grade "b" and grade "d". This imbalance can also be explained by the low number of products with "allergens" labels on them. (845 for the US, and 82895 for the rest of the world) Hence, we can conclude that the impact of the allergens on the US products cannot be proven.

E. Palm Oil Analysis:

Comments:
Finally, we are going to study the use of palm oil ingredients (or ingredients that may be from palm oil) on the food consumption.

In [81]:
palm_list = ['ingredients_from_palm_oil_n',
 'ingredients_from_palm_oil',
 'ingredients_from_palm_oil_tags',
 'ingredients_that_may_be_from_palm_oil_n',
 'ingredients_that_may_be_from_palm_oil',
 'ingredients_that_may_be_from_palm_oil_tags','nutrition_grade_fr']

df_palm = partb[palm_list]
df_us_palm = partb_us[palm_list]

print('the number of the Palm Oil products in the whole dataframe : ', len(df_palm.ingredients_from_palm_oil_n.dropna().index))
print('the number of the Palm Oil products in the US dataframe : ', len(df_us_palm.ingredients_from_palm_oil_n.dropna().index))
the number of the Palm Oil products in the whole dataframe :  447135
the number of the Palm Oil products in the US dataframe :  170906
In [82]:
prod_dist('Distribution of the number of Palm Oil Ingredients by products','ingredients_from_palm_oil_n', 'Palm oil Ingredients' )

Comments:
Here we wanted to see the distribution of the number of palm oil ingredients in some of our products (in the world dataset). Hence, most of the products only contain one palm oil ingredient.
Nonetheless, we do have food products with two or three ingredients containing palm oil, which is inevitably bad for your health.

In [83]:
palm_oil_tags_new = partb.ingredients_from_palm_oil_tags.dropna().map(lambda x : x.lower())
df_palm_ingredients = partb.iloc[palm_oil_tags_new.index]
df_result_palm = pd.DataFrame(df_palm_ingredients.nutrition_grade_fr)
grade_dist("Grade Distribution for Palm Oil Products in the World",df_result_palm )
In [84]:
partb_new_null_palm_oil_tags = partb.ingredients_from_palm_oil_tags.isnull()
df_palm_oil_tags_null = partb.iloc[partb_new_null_palm_oil_tags.index]
df_result_palm_oil_tags_null = pd.DataFrame(df_palm_oil_tags_null.nutrition_grade_fr)
grade_dist("Grade Distribution for Non-Palm Oil Products in the World", df_result_palm_oil_tags_null)

Comments:

Regarding the world dataset, it can clearly be seen the the grade distribution for palm oil ingredients is left-skewed and it is normal that products containing palm oil should have the worst nutrition grades.
It also means that palm oil ingredients have a really bad impact on a person's health, and the grade system displays it.

In [85]:
palm_oil_tags_new_us = partb_us.ingredients_from_palm_oil_tags.dropna().map(lambda x : x.lower())
df_palm_us_ingredients = partb.iloc[palm_oil_tags_new_us.index]
df_result_us_palm = pd.DataFrame(df_palm_us_ingredients.nutrition_grade_fr)
In [86]:
grade_dist("Grade Distribution for Palm Oil Products in the US", df_result_us_palm)
print('The number of products studied in the US Palm Oil Dataframe is only : ', len(df_palm_us_ingredients.index))
The number of products studied in the US Palm Oil Dataframe is only :  3
In [87]:
#create a column that will be used in order to compute the number of suspected palm oil ingredients in a product
df_countries['palm_oil_n'] = df_countries['ingredients_from_palm_oil_n'] + df_countries['ingredients_that_may_be_from_palm_oil_n']

df_countries_palm = df_countries[df_countries.palm_oil_n.notnull()]
df_groupedby_countries_palm = df_countries_palm.groupby(['countries']).mean().palm_oil_n.reset_index()

np_countries_palm = np.array(df_groupedby_countries_palm)
np_countries_palm = np_countries_palm[np_countries_palm[:,1].argsort()[::-1]]

# Plot the average number of additives per country
fig = plt.figure(figsize=(23,8))
ax1 = fig.add_subplot(1,1,1)
y_pos = np.arange(len(np_countries_palm[:,0]))
x_pos = np_countries_palm[:,1]
x_ticks = np_countries_palm[:,0]

# Make a barplot
plt.bar(y_pos, x_pos, align='center')
plt.title('Average number of Palm oil ingredients per product by country')
plt.xticks(y_pos, x_ticks)
plt.ylabel('Average number of Palm oil ingredients') 
plt.show()
In [88]:
import plotly.express as px
import country_converter as coco
cc = coco.CountryConverter()

#we create here the ISO-3 country codes in order to plot our map in plotly
df_groupedby_countries_palm['iso_code'] = df_groupedby_countries_palm['countries'].apply(lambda x : coco.convert(names=x, to='ISO3'))
#we also drop 
df_groupedby_countries_palm = df_groupedby_countries_palm[df_groupedby_countries_palm['palm_oil_n'] != 0]

fig = px.choropleth(df_groupedby_countries_palm, locations="iso_code",
                    color="palm_oil_n", # Palm_oil_n is a column of our dataframe
                    hover_name="countries", # column to add to hover information
                    color_continuous_scale=px.colors.sequential.algae)
fig.show()

Comments:

To conclude, we wanted to see the distribution of food products containing palm oil ingredients in regard to the nutrition grade.
Hence, with no surprise, we see that the number of products increase gradually with the negative grades (the most number of products have the worst grade which is "e") in the world dataset.

However, we cannot conclude anything the US dataset because we only have 3 products containing in their labels the "palm oil" tag. This can be explained by the fact that many US lobbies want their products to continue using palm oil without informing the public. (source "The Guardian")

Finally, an important result is the ranking of countries that have products suspected to use palm oil ingredients. The countries that use a lot of palm oil ingredients are algeria, south africa and belgium.
However, it is surprising to see countries such as australia, germany and united states at the bottom of this ranking. This can easily be explained by the fact that these countries have really "light" food regulations on palm oil (many of our sources confirm this statement).

Sources:
https://www.theguardian.com/news/2019/feb/19/palm-oil-ingredient-biscuits-shampoo-environmental
https://www.foodstandards.gov.au/consumer/generalissues/palmoil/Pages/default.aspx
https://mobil.wwf.de/fileadmin/fm-wwf/Publikationen-PDF/WWF_Report_Palm_Oil_-_Searching_for_Alternatives.pdf
https://www.alumniportal-deutschland.org/en/global-goals/sdg-12-consumption/palm-oil-rainforest-indonesia/
https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfcfr/CFRSearch.cfm?fr=101.4

F. Correlation Between Nutrition Grade (FR) and allergens, additives and palm oil ingredients:

In [89]:
sns.set(style="white")

#list of correlation elements that we will study
list_additives_palm = ['nutrition_grade_fr', 'additives_n', 'ingredients_from_palm_oil_n', 'ingredients_that_may_be_from_palm_oil_n' ]

#correlation matrix function specific to our study
def corr_matrix(df_test, title):
    fig = plt.figure(figsize=(20,20))
    df_test_new = df_test[list_additives_palm]
    df_test_new = pd.get_dummies(df_test_new, columns=['nutrition_grade_fr'])
    plt.matshow(df_test_new.corr())
    plt.title(title)
    plt.show();
In [90]:
corr_matrix(partb, 'Nutrition Grade Correlation Matrix on the World');
<Figure size 1440x1440 with 0 Axes>
In [91]:
corr_matrix(partb_us, 'Nutrition Grade Correlation Matrix on the World');
<Figure size 1440x1440 with 0 Axes>

Comments: Here we wanted to plot the correlation matrices regarding our findings in the 3 previous parts : we only find a slight correlation between additives and palm oil (0 and 2 columns).

PART 3 :                         Joining Dunnhumby and OpenFoodFacts

We proceeded to cleaning each one of the two datasets, then analyzed most of their features as well as their correlations. Now the two datasets are ready to be joined. these products. The idea is to complete the Dunnhumby dataset by adding to products purchased by households information about their ingredients and nutritive value, in order to evaluate the healthiness of the food product. Dunnhumby classifies food products at different levels, Subcommodity, commoditiy and Department. While OpenFoodFacts has a Product name attribute and a Category feature. Comparing these features we decided to merge the two datasets based on the subcommodity attribute for Dunnhumby and the product name for OpenFoodFacts. Indeed, these have the closest product classification level. However the labels are not exactly the same between Subcommodity and Product name. We need to implement a function that finds the product in OpenFoodFacts with the closest name to a subcommodity in Dunnhumby. For this we used changed the name of the product on both dataset by applying the same methodology :

  • First,we applied Casfolding (everthing lower-case) and tokenization this helped to have less sparsity between the datasets.

  • Then, we applied stopword removal of nltk for english, and to boost the performance of our analysis we went further and created a list of words that are usell for our tast (like: brands name,color,etc...)

  • Furthemeor, we applied lemmatization every word using WordNetLemmatizer of nltk library.

  • Finaly we used the Difflib function SequenceMatcher, which compares two strings and provides a ratio indicating the ‘closeness’ of the two names. Using a threshold method we keep the pairs found with a ratio above 0.7. We also add two rules in order to give priority to exact matches, and to the product names included in the subcommodity name or vis-versa.

We constructed a dictionary on products name matches between Dunnhumby (keys) and OpenFoodFacts(values). We can notice that we have a lot of exact matches, and a big number of good results. We are unable to build a complete dictionary which allows us to assign all the products present in Dunhmby to a product present in OpenFoodFacts. Only on third of the products listed in dunhmby have a match with a product in OpenFoodFacts.

We decided to keep the shown dictionary for this report,and exctract some meaningfull information.

In [92]:
import nltk
from nltk.stem import WordNetLemmatizer 
from nltk.tokenize import word_tokenize 
from nltk.corpus import stopwords
import seaborn as sns
from sklearn import manifold, datasets
from adjustText import adjust_text
from difflib import SequenceMatcher
import gensim
from gensim.utils import *
from difflib import SequenceMatcher
import pandas as pd
import numpy as np
import csv
lemmatizer = nltk.stem.WordNetLemmatizer()
In [93]:
def lemmatize_text(txt):
    
    #Apply case folding
    txt = txt.lower()
    
    #Aplly tokenization
    txt = simple_preprocess(txt)
    
    #Lemmatization
    lemmatized_words = [lemmatizer.lemmatize(word) for word in txt]
    
    #Stopword removal
    stop_words = list(set(stopwords.words('english')))
    other_stop_words = ['natural', 'original', 'organic', 'except', 'variety', 'thai', 'heaven', 'divine', 'heavenly', 'divinely', 'refined', 'food', 'quick', 
               'style', 'store', 'best', 'company', 'magic', 'gourmet', 'guiltless', 'other', 'super', 'prem', 'mini', 'dark', 'aunt', 'tricolor', 'freshly', 'williams', 'lundberg', 
               'golden', 'olde', 'warner', 'world', 'hudson', 'ryan', 'emojeez', 'merry', 'christmas', 'yam', 'yan', 'grove', 'confectionery', 'co', 'halloween', 'st', 'pat', 
               'deluxe', 'premium', 'gold', 'best', 'petite', 'lindor', 'family', 'kid', 'sprungli', 'passion', 'delicious', 'hello', 'panda', 'special', 'large', 'party', 
                'mix', 'assorted', 'round', 'creation', 'covered', 'bite', 'lindt', 'cooked', 'island', 'icicle', 'fine', 'rocher', 'four' , 'psst', 'homestyle', 'real', 'finely', 
                'monterey', 'jack', 'imitation', 'vit', 'part', 'artic', 'blaster', 'sammies', 'havarti', 'zesty', 'traditional', 'old', 'fashioned', 'microwave', 'kettle', 'movie', 
                'theater', 'classic', 'add', 'piece', 'peeled', 'segment', 'truth', 'simple', 'product', 'harvest', 'handpicked', 'added', 'paul', 'mr', 
                'fisherman', 'alfredo', 'fettuccini', 'cellas', 'arborio' , 'naan', 'dahi', 'luxury', 'dreamhouse','harvest','calbee', 'acidophilus', 
                'probiotic','triple','cleaned', 'instant', 'homestyle', 'stadium', 'hawaiian', 'koepplinger', 'classic','woodfield', 'bunker','hill','mountain', 
                'frutel' ,'evaporated', 'added','opaa','conchita', 'squeeze' ,'cha','act','ii','kln','brand','happy','valentine','day','sampler','vitarroz',
                'tam','fun','tillamookies','richardson','mr','florentine','yamamotoyama', 'loretta','joseph','flax','ala','wan','ja','wanjashan','wismettac','free',
                'range','aa','froyo','garcia','boxed', 'baby', 'adult', 'child', 'assortment']
    stop_words.extend(other_stop_words)
    filtered_words = [word for word in lemmatized_words if word.lower() not in stop_words]
    
    #Remove redunduncy in the final product name 
    unique_words = list(set(filtered_words))
    txt = ' '.join(unique_words)
    
    return txt
In [94]:
def most_frequent(List): 
    counter = 0
    num = List[0] 
      
    for i in List: 
        curr_frequency = List.count(i) 
        if(curr_frequency> counter): 
            counter = curr_frequency 
            num = i 
  
    return num
In [95]:
def age_function(x):
    n = len(demographic.AGE_DESC.value_counts())
    values = demographic.AGE_DESC.value_counts().sort_index(ascending=True).index
    for i in range(3):
        for j in range(int(n/3)):
            if(i == 0 and x == values[j]):
                return '19-34'
            elif(i == 1 and x == values[j + int(n/3)]):
                return '35-54'
            elif(i == 2 and x == values[j + 2*int(n/3)]):
                return '55-65+'
In [96]:
def income_function(x):
    n = len(demographic.INCOME_DESC.value_counts())
    values = demographic.INCOME_DESC.value_counts().sort_index(ascending=True).index
    for i in range(4):
        for j in range(int(n/4)):
            if(i == 0 and x == values[j]):
                return '-15-34K'
            elif(i == 1 and x == values[j + int(n/4)]):
                return '035-99K'
            elif(i == 2 and x == values[j + 2*int(n/4)]):
                return '100-174K'
            elif(i == 2 and x == values[j + 3*int(n/4)]):
                return '175-250K+'
In [97]:
def grade_food(x):
    a = x.count('a')/len(x)
    b = x.count('b')/len(x)
    c = x.count('c')/len(x)
    d = x.count('d')/len(x)
    e = x.count('e')/len(x)
    counts = [a, b, c, d, e]
    labels = ['a','b','c','d','e']
    best = max(counts)
    
    return labels[counts.index(best)]

Handling text for Openfoodfacts:

In [98]:
openfoodfacts = pd.read_csv('dataset/OpenFoodFacts_processed.csv')

#Drop all the products in OpenFoodFacts with no nutrition grades
openfoodfacts = openfoodfacts.dropna(subset = ['nutrition_grade_fr'])

#Apply our text processing pipline 
openfoodfacts['openfoodfacts_comun_name'] = openfoodfacts.product_name.apply(lemmatize_text)

#In ordre to remove some reduncy in our Dataframe, we decided for each product name which is not unique to keep only the most
#frequent nutrition grade and the mean of all Macronutrients
nt_grade = openfoodfacts.groupby(['openfoodfacts_comun_name']).nutrition_grade_fr.apply(list).reset_index().rename(columns={'openfoodfacts_comun_name':'openfoodfacts_comun_name'})
nt_grade.nutrition_grade_fr = nt_grade.nutrition_grade_fr.apply(most_frequent)
openfoodfacts = openfoodfacts.drop(columns = ['nutrition_grade_fr'])
openfoodfacts_dict = pd.merge(nt_grade,openfoodfacts,on = 'openfoodfacts_comun_name')
macro = openfoodfacts_dict.groupby(['openfoodfacts_comun_name'])['energy_100g','fat_100g','carbohydrates_100g','sugars_100g','proteins_100g','salt_100g'].median().reset_index().rename(columns={'openfoodfacts_comun_name':'openfoodfacts_comun_name'})
openfoodfacts_dict = openfoodfacts_dict.drop(columns = ['energy_100g','fat_100g','carbohydrates_100g','sugars_100g','proteins_100g','salt_100g'])
openfoodfacts_dict = pd.merge(macro,openfoodfacts_dict,on = 'openfoodfacts_comun_name')
openfoodfacts_dict = openfoodfacts_dict.drop_duplicates(subset=['openfoodfacts_comun_name'], keep='last')
openfoodfacts_dict = openfoodfacts_dict.drop(openfoodfacts_dict[openfoodfacts_dict.openfoodfacts_comun_name.apply(lambda x: x=='')].index)

#Drop all the columns that are not interrsting for our analysis
openfoodfacts_dict = openfoodfacts_dict.drop(columns = ['allergens','allergens.1', 'ingredients_from_palm_oil', 'additives','countries_en', 'additives_tags', 'sum_elements', 'other_carbs',
                                               'reconstructed_energy'])

Handling text for Dunhumby:

In [99]:
demographic = pd.read_csv('dataset/hh_demographic.csv', delimiter=',')
demographic.INCOME_DESC = demographic.INCOME_DESC.apply(correct_indecome)
demographic.MARITAL_STATUS_CODE = demographic.MARITAL_STATUS_CODE.apply(correct_marital_status)
demographic.MARITAL_STATUS_CODE = demographic.apply(lambda row: replace_unknowns(row['HH_COMP_DESC'],
                                                    row['MARITAL_STATUS_CODE'], row['HOUSEHOLD_SIZE_DESC']),axis=1)
demographic['categorical_age'] = demographic.AGE_DESC.apply(age_function)
demographic['categorical_income'] = demographic.INCOME_DESC.apply(income_function)
transaction = pd.read_csv('dataset/transaction_data.csv', delimiter=',')
demographic_households = demographic.household_key.unique()
transaction = transaction[transaction.household_key.apply(lambda x: x in demographic_households)]
product = pd.read_csv('dataset/product.csv', delimiter=',')
list_food = ['GROCERY','PASTERY','PRODUCE','NUTRITION','MEAT','FROZEN GROCERY','SALAD BAR','SEAFOOD','SPIRITS','SEAFOOD-PCKGD','MEAT-PCKG','DELI'] 
food_related_products = product[product.DEPARTMENT.apply(lambda x: x in list_food)]
food_related_products.drop(columns = ['BRAND','CURR_SIZE_OF_PRODUCT','MANUFACTURER'],inplace = True)
consumution = transaction.drop(columns = ['STORE_ID','RETAIL_DISC','TRANS_TIME','COUPON_DISC','COUPON_MATCH_DISC','BASKET_ID','DAY','WEEK_NO'])
In [100]:
#Apply our text processing pipline  our the product dataframe (for efficiency)
food_related_products['dunhumby_comun_name'] = food_related_products.SUB_COMMODITY_DESC.apply(lemmatize_text)
food_related_products = food_related_products.drop(food_related_products[food_related_products.dunhumby_comun_name.apply(lambda x: x=='')].index)
food_related_products = food_related_products.drop_duplicates(subset=['dunhumby_comun_name'], keep='last')

#Join product dataframe with demographic and transcation dataframe
inter = pd.merge(food_related_products,consumution,on =  'PRODUCT_ID')
full_data_set = pd.merge(inter,demographic,on = 'household_key')
In [ ]:
#Function to creat our dictionnary
flag = False
dic = {}
for i in food_related_products.dunhumby_comun_name:
    print(i)
    d1 = {}
    d2 = {}
    flag = False
    flag_in = False
    for j in openfoodfacts_dict.openfoodfacts_comun_name:
        #Condition if both word are the same
        if i == j:
            dic[i] = j
            flag = False
            flag_in = False
            break
        #Condition if one word is contained in the other word    
        elif (i in j) or (j in i):
            s = SequenceMatcher(None, i, j)
            d2[j] = s.ratio()
            flag_in = True
        else:
            s = SequenceMatcher(None, i, j)
            d1[j] = s.ratio()
            flag = True
    #Extract the best score comparing the the previous conditions
    if flag_in:
        m2 = max(d2.values())
        key_list2 = list(d2.keys()) 
        val_list2 = list(d2.values())
        if flag:
            m1 = max(d1.values())
            key_list1 = list(d1.keys()) 
            val_list1 = list(d1.values()) 
            if m2>=m1 and m2>=0.7:
                print('dunhumby : ', i,', openfoodfacts : ', key_list2[val_list2.index(m2)],', ratio :', m2, ', in')
                dic[i] = key_list2[val_list2.index(m2)]
            else:
                if m1 >= 0.75:
                    print('dunhumby : ', i,', openfoodfacts : ', key_list1[val_list1.index(m1)],', ratio :', m1)
                    dic[i] = key_list1[val_list1.index(m1)]
        else:
            print('dunhumby : ', i,', openfoodfacts : ', key_list2[val_list2.index(m2)],', ratio :', m2, ', in')
            dic[i] = key_list2[val_list2.index(m2)]
    elif flag:
        m = max(d1.values())
        key_list = list(d1.keys()) 
        val_list = list(d1.values())
        if m>=0.75:
            print('dunhumby : ', i,', openfoodfacts : ', key_list[val_list.index(m)],', ratio :', m)
            dic[i] = key_list[val_list.index(m)]
In [ ]:
w = csv.writer(open("merge.csv", "w+"))
for key, val in dic.items():
    w.writerow([key, val])
w.close()
In [101]:
#The dictionary is created once, du random lemmatization
dictionaire =  pd.read_csv('dataset/merge.csv',header=None)

openfoodfacts_dict = pd.read_csv('dataset/dictionaire.csv',header=None)

dictionaire.columns = ['dunhumby_comun_name','openfoodfacts_comun_name']

dictionaire =  pd.read_csv('dataset/merge.csv',header=None)
dictionaire.columns = ['dunhumby_comun_name','openfoodfacts_comun_name']

inter_medaire =  pd.read_csv('dataset/inter_medaire.csv')
grades =  pd.read_csv('dataset/grades.csv')

Comments:
From the plot, we can conclude that our dictionary has kept products that are consumed in our daily life, without significantly changes in the order of the most consumed products.

In [102]:
import matplotlib.pyplot as plt
plt.figure(num = None, figsize = (10,10))
ax = grades.dunhumby_comun_name.value_counts()[:10].plot.bar()
ax.title.set_text('Most frequent product')
plt.show()

Comments:
An important step before any analysis is the check if the amount of producs present in the dataset for each nutrion grade (to look for balanced or unbalanced categories). We can clearly observe that the dataset is slighty unbalanced in favor of nutrition grade 'd'.

In [103]:
plt.figure(num = None, figsize = (10,10))
ax = inter_medaire.nutrition_grade_fr.value_counts().sort_index().plot.bar()
ax.title.set_text('Most frequent product')
plt.show()

Comments:
First let us start with the impact of the income on the nutrition grade. Surprisingly, the results are in accordance with our hypothesis. Households with an income between 15K and 50K dollars tend to buy food products with low nutrition grade. As the income grows higher, the nutrition grade of consumed foods improves. For incomes between 175K and 250K dollars, we can see that the most frequent nutrition grade of consumed food products is very high. This shows that income has a direct relationship with how healthy a household eats. Wealthy households seem to eat a lot healthier than others with lower income. However, we can observe that for the super rich ones, the nutrition grade drops dramatically.Let us now examine these observations in details. Considering the nutrition grade only is not completely representative of these influences. We complete it by looking at the food composition..

In [104]:
#Compute the most freqente nutrition grade consumed by each income category
categorical_income_nutrition = grades.groupby(['INCOME_DESC']).nutrition_grade_fr.apply(list).reset_index().rename(columns={'categorical_income':'categorical_income'})
categorical_income_nutrition.nutrition_grade_fr = categorical_income_nutrition.nutrition_grade_fr.apply(grade_food)
categorical_income_nutrition
Out[104]:
INCOME_DESC nutrition_grade_fr
0 -15K d
1 015-024K d
2 025-034K d
3 035-049K d
4 050-074K c
5 075-099K c
6 100-124K c
7 125-149K b
8 150-174K c
9 175-199K a
10 200-249K a
11 250K+ d
In [105]:
categorical_income_macro = grades.groupby(['INCOME_DESC','household_key'])['energy_100g','fat_100g','carbohydrates_100g','sugars_100g','proteins_100g','salt_100g'].median().reset_index().rename(columns={'openfoodfacts_comun_name':'openfoodfacts_comun_name'})
categorical_income_macro = categorical_income_macro.groupby(['INCOME_DESC'])['energy_100g','fat_100g','carbohydrates_100g','sugars_100g','proteins_100g','salt_100g'].mean().reset_index().rename(columns={'openfoodfacts_comun_name':'openfoodfacts_comun_name'})
categorical_income_macro
Out[105]:
INCOME_DESC energy_100g fat_100g carbohydrates_100g sugars_100g proteins_100g salt_100g
0 -15K 842.548913 7.869511 23.349783 11.775054 5.435217 0.652618
1 015-024K 1071.650000 10.638900 32.131000 13.617550 6.768700 0.679427
2 025-034K 1230.254717 17.504953 25.099764 13.281840 6.389623 0.732904
3 035-049K 1117.080435 15.474761 24.174783 10.392130 6.101587 0.700916
4 050-074K 982.591870 10.527249 26.421108 12.274573 5.126126 0.489663
5 075-099K 1106.778169 12.460246 27.683345 11.940282 5.503451 0.493753
6 100-124K 973.943182 7.712273 32.231591 14.093068 4.479318 0.522473
7 125-149K 1126.163793 13.342586 27.271379 11.143621 5.691983 64.828846
8 150-174K 892.875000 6.932000 26.726875 17.268958 3.766563 0.373243
9 175-199K 989.027778 8.379444 27.714167 16.010556 5.996667 0.292947
10 200-249K 640.250000 3.803750 11.252500 2.170000 2.445000 0.408940
11 250K+ 1009.600000 4.855500 39.497000 14.611750 7.127000 0.617836

Comments:
Let's analyse the distribution of macro for each categorical income. Low income households seem to consume more fat and proteins than others, they also seem to pick in general high energetic products. Wealthy households, in contrast, has a much more balanced diet with more or less equal moderate quantities of fat, carbohydrates and proteins, and choose products with lower energy value.

In [106]:
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
categorical_income_macro_scaled =  pd.DataFrame(scaler.fit_transform(categorical_income_macro[['energy_100g','fat_100g',
                                                                             'carbohydrates_100g','proteins_100g']]),columns = ['energy_100g','fat_100g',
                                                                             'carbohydrates_100g','proteins_100g'])

categories=list(categorical_income_macro_scaled.columns)
N = len(categories)
 
# What will be the angle of each axis in the plot? (we divide the plot / number of variable)
angles = [n / float(N) * 2 *np. pi for n in range(N)]
angles += angles[:1]

ax = plt.subplot(111, polar=True,autoscale_on = True)
 
ax.set_theta_offset(0)
ax.set_theta_direction(-1)
plt.xticks(angles[:-1], ['Ration energy','Ration fat','Ration carbohydrates','Ration proteins'])
ax.set_rlabel_position(0)
plt.yticks([0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0], ['0.1','0.2','0.3','0.4','0.5','0.6','0.7','0.8','0.9','1.0'], color="grey", size=7)
plt.ylim(0,1)

labels = categorical_income_macro.INCOME_DESC.values

# Ind1
for i in range(12):
    values= categorical_income_macro_scaled.iloc[i][:].values.flatten().tolist()
    values += values[:1]
    ax.plot(angles, values, linewidth=1, linestyle='solid', label= labels[i])
    ax.fill(angles, values, 'b', alpha=0.1)
    
# Add legend
plt.legend(loc='upper right', bbox_to_anchor=(0.1, 0.1))
Out[106]:
<matplotlib.legend.Legend at 0x23ae8fb8a48>

Comments:
We tehn compared the influcence of the age on the nutrition grade. Surprisingly, the resultas are mixed this time, there is no real trends. The different age categories have the consumes mostly products with a nutrition grade of 'd', except the categories (65+ and 45-54).Considering the nutrition grade only is not completely representative of these influences. We complete it by looking at the food composition.

In [107]:
#Compute the most freqente nutrition grade consumed by each age category
categorical_age_nutrition = grades.groupby(['AGE_DESC']).nutrition_grade_fr.apply(list).reset_index().rename(columns={'categorical_income':'categorical_income'})
categorical_age_nutrition.nutrition_grade_fr = categorical_age_nutrition.nutrition_grade_fr.apply(grade_food)
categorical_age_nutrition
Out[107]:
AGE_DESC nutrition_grade_fr
0 19-24 d
1 25-34 d
2 35-44 d
3 45-54 c
4 55-64 d
5 65+ c

Comments:
If we do the same for the age category, we see that young households tend to consume more fats and sugars (carbohydrates) as well as high energetic products. Middle aged households seem to have a well balanced nutrition, with low fats and low sugar. Also the food products they choose have low energetic value. Seniors, on their side, consume very low sugar, but have a diet with a lot proteins and fat, and the products they buy are very energetic.

In [ ]:
categorical_age_macro = grades.groupby(['AGE_DESC','household_key'])['energy_100g','fat_100g','carbohydrates_100g','sugars_100g','proteins_100g','salt_100g'].median().reset_index().rename(columns={'AGE_DESC':'AGE_DESC'})
categorical_age_macro  = categorical_age_macro .groupby(['AGE_DESC'])['energy_100g','fat_100g','carbohydrates_100g','sugars_100g','proteins_100g','salt_100g'].mean().reset_index().rename(columns={'AGE_DESC':'AGE_DESC'})
categorical_age_macro 
In [109]:
categorical_age_macro_scaled =  pd.DataFrame(scaler.fit_transform(categorical_age_macro[['energy_100g','fat_100g',
                                                                             'carbohydrates_100g','proteins_100g']]),columns = ['energy_100g','fat_100g',
                                                                             'carbohydrates_100g','proteins_100g'])
categories=list(categorical_age_macro_scaled.columns)
N = len(categories)
 
# What will be the angle of each axis in the plot? (we divide the plot / number of variable)
angles = [n / float(N) * 2 *np. pi for n in range(N)]
angles += angles[:1]

ax = plt.subplot(111, polar=True,autoscale_on = True)
 
ax.set_theta_offset(0)
ax.set_theta_direction(-1)
plt.xticks(angles[:-1], ['Ration energy','Ration fat','Ration carbohydrates','Ration proteins'])
ax.set_rlabel_position(0)
plt.yticks([0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0], ['0.1','0.2','0.3','0.4','0.5','0.6','0.7','0.8','0.9','1.0'], color="grey", size=7)
plt.ylim(0,1)
labels = categorical_age_macro.AGE_DESC.values
for i in range(6):
    values=categorical_age_macro_scaled.iloc[i][:].values.flatten().tolist()
    values += values[:1]
    ax.plot(angles, values, linewidth=1, linestyle='solid', label=labels[i])
    ax.fill(angles, values, 'b', alpha=0.1)
    
plt.legend(loc='upper right', bbox_to_anchor=(0.1, 0.1))
Out[109]:
<matplotlib.legend.Legend at 0x23ae8fccd88>

Comments:
As we expected, we could observe that there is a direct link between social status and the healthiness of consumed food. But, we would like to know if these observations can be statistically confirmed. We take a look at the correlations between income, age, food prices and nutrition grade. We see that there are some slight correlations between high income and high nutrition grades, middle incomes seem to be slightly correlated as well with middle nutrition grades, while low incomes are correlated with low nutrition grades. For the age, the correlation with nutrition grade seems to be linear, meaning that young age categories are slightly correlated with low nutrition grades, and the older the households are, the higher is the correlation with a better nutrition grade.

In [110]:
one_hot_income = pd.get_dummies(grades, columns=["AGE_DESC",'INCOME_DESC','nutrition_grade_fr'])
corr = one_hot_income.corr(method = 'pearson')

mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
fig, ax = plt.subplots(figsize=(10,10))         # Sample figsize in inches
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5});

Conclusion:

Food consumption is at the heart of many debates : the human race has always centered its attention toward this basic need which defines us.
During this study, we finally established a connection between social status and food consumption. There is now a clear common trend between wealthiness and healthiness.
But this trend truly highlights the fact that cultures from around the world and especially the United States will share common food consumption trends in regards to this perpetual need of health, which means : less additives, less palm oil... less excess to sum it up!

For further work, it would be great to have access to a database similar to dunnhumby but applied to more retailers from around the world. Naturally, we also want more data in the Open-Food-Facts Database in order to apply some of our previous analysis techniques on many more countries previously ignored because of a lack of data.
Most importantly, some fresh ideas could be applied in this study, by finding a more "social" database that could identify patterns between any social network (friends, family, political orientation, religion etc...) and our daily food consumption in order to have a more comprehensive and broad point of view !

In [ ]: